NYC TAXI

The competition dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this playground competition. Based on individual trip attributes, participants should predict the duration of each trip in the test set.

File descriptions

  • train.csv - the training set (contains 1458644 trip records)
  • test.csv - the testing set (contains 625134 trip records)
  • sample_submission.csv - a sample submission file in the correct format

Data fields

  • id - a unique identifier for each trip
  • vendor_id - a code indicating the provider associated with the trip record
  • pickup_datetime - date and time when the meter was engaged
  • dropoff_datetime - date and time when the meter was disengaged
  • passenger_count - the number of passengers in the vehicle (driver entered value)
  • pickup_longitude - the longitude where the meter was engaged
  • pickup_latitude - the latitude where the meter was engaged
  • dropoff_longitude - the longitude where the meter was disengaged
  • dropoff_latitude - the latitude where the meter was disengaged
  • store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server: Y=store and forward; N=not a store and forward trip.
  • trip_duration - duration of the trip in seconds

Disclaimer: The decision was made to not remove dropoff coordinates from the dataset order to provide an expanded set of variables to use in Kernels.

In [1]:
import pandas as pd  #pandas for using dataframe and reading csv 
import numpy as np   #numpy for vector operations and basic maths 
#import simplejson    #getting JSON in simplified format
import urllib        #for url stuff
#import gmaps       #for using google maps to visulalize places on maps
import re            #for processing regular expressions
import datetime      #for datetime operations
import calendar      #for calendar for datetime operations
import matplotlib.pyplot as plt # for plotting basic graphs and pie charts
import time          #to get the system time
import scipy         #for other dependancies
from sklearn.cluster import KMeans # for doing K-means clustering
from haversine import haversine # for calculating haversine distance
import math          #for basic maths operations
import seaborn as sns #for making plots
import matplotlib.pyplot as plt # for plotting
import os  # for os commands
from scipy.misc import imread, imresize, imsave  # for plots 
import plotly.plotly as py
import plotly.graph_objs as go
import plotly
plotly.offline.init_notebook_mode() # run at the start of every ipython notebook
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [2]:
# Creating the INPUT FOLDER which will contain our input files!
INPUT_FOLDER='/Users/as186194/Documents/DOCUMENTS/TRIALS/Kaggle/Kaggle_NYC_Taxi/'
print ('File Sizes:')
for f in os.listdir(INPUT_FOLDER):
    if 'zip' not in f:
       print (f.ljust(30) + str(round(os.path.getsize(INPUT_FOLDER +  f) / 1000000, 2)) + ' MB')
File Sizes:
.DS_Store                     0.01 MB
model_df.csv                  118.12 MB
NYC_Taxi.ipynb                0.46 MB
NYC_Taxi.pptx                 0.61 MB
test.csv                      70.79 MB
train.csv                     200.59 MB
In [3]:
train_df=pd.read_csv(INPUT_FOLDER + 'train.csv')
In [115]:
test_df=pd.read_csv(INPUT_FOLDER + 'test.csv')
In [5]:
train_df.head()
Out[5]:
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435
In [6]:
train_df.shape
Out[6]:
(1458644, 11)
In [7]:
train_df.columns
Out[7]:
Index(['id', 'vendor_id', 'pickup_datetime', 'dropoff_datetime',
       'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'trip_duration'],
      dtype='object')
In [8]:
test_df.head()
Out[8]:
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N
In [9]:
test_df.shape
Out[9]:
(625134, 9)
In [10]:
test_df.columns
Out[10]:
Index(['id', 'vendor_id', 'pickup_datetime', 'passenger_count',
       'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'store_and_fwd_flag'],
      dtype='object')
In [11]:
vendor=train_df.groupby("vendor_id").size()

colors = ['gold', 'lightskyblue']
plt.pie(vendor, shadow = True, colors = colors, labels = ['Vendor 1', 'Vendor 2'],
        autopct='%1.1f%%')
plt.title('Cab Vendors')
Out[11]:
<matplotlib.text.Text at 0x136c5a6a0>
In [12]:
duration=train_df.groupby('trip_duration').size()
In [13]:
duration.hist(bins=50,figsize=(10,8))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ae0d978>
In [14]:
duration.plot(kind='bar')
plt.show()
In [15]:
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(figsize=(11, 7), sharex=True)
sns.despine(left=True)
sns.distplot((train_df['trip_duration'].values+1), 
axlabel = 'trip_duration', label = 'trip_duration', bins=50)
plt.setp(axes, yticks=[])
plt.tight_layout()
In [16]:
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(1, 1, figsize=(11, 7), sharex=True)
sns.despine(left=True)
sns.distplot(np.log(train_df['trip_duration'].values+1),
             axlabel = 'Log(trip_duration)', label = 'log(trip_duration)', bins = 50, color="g")
plt.setp(axes, yticks=[])
plt.tight_layout()
end = time.time()

Findings - The above histrogram and kernel density plot that the trip-durations are like gaussian and few trips have very large duration, like ~350000 seconds which is 100 hours, while most of the trips are:

  • e^4 = 1 minute to
  • e^8 ~ 60 minutes

Probably trips are taken inside Manhattan or in New York only.

In [17]:
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(2,2,figsize=(10, 10), sharex=False, sharey = False)
sns.despine(left=True)
sns.distplot(train_df['pickup_latitude'].values, label = 'pickup_latitude',color="m",bins = 100, ax=axes[0,0])
sns.distplot(train_df['pickup_longitude'].values, label = 'pickup_longitude',color="m",bins =100, ax=axes[0,1])
sns.distplot(train_df['dropoff_latitude'].values, label = 'dropoff_latitude',color="m",bins =100, ax=axes[1, 0])
sns.distplot(train_df['dropoff_longitude'].values, label = 'dropoff_longitude',color="m",bins =100, ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()
plt.show()

Findings - From the plot above it is clear that pick and drop latitude are centered around 40 to 41, and longitude are situated around -74 to -73.

We are not getting any histogram kind of plots when we are plotting lat - long as the distplot frunction of sns is getting affacted by outliers, trips which are very far from each other like lat 32 to lat 44, are taking very long time, and affacted this plot such that it is coming of as a spike.

Let's remove those large duration trip by using a cap on lat-long and visulaize the distributions of latitude and longitude given to us.

In [18]:
df = train_df.loc[(train_df.pickup_latitude > 40.6) & (train_df.pickup_latitude < 40.9)]
df = df.loc[(df.dropoff_latitude>40.6) & (df.dropoff_latitude < 40.9)]
df = df.loc[(df.dropoff_longitude > -74.05) & (df.dropoff_longitude < -73.7)]
df = df.loc[(df.pickup_longitude > -74.05) & (df.pickup_longitude < -73.7)]
train_data = df.copy()
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(2,2,figsize=(12, 12), sharex=False, sharey = False)#
sns.despine(left=True)

sns.distplot(train_data['pickup_latitude'].values, label = 'pickup_latitude',color="m",bins = 100, ax=axes[0,0])
sns.distplot(train_data['pickup_longitude'].values, label = 'pickup_longitude',color="g",bins =100, ax=axes[0,1])
sns.distplot(train_data['dropoff_latitude'].values, label = 'dropoff_latitude',color="m",bins =100, ax=axes[1, 0])
sns.distplot(train_data['dropoff_longitude'].values, label = 'dropoff_longitude',color="g",bins =100, ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()


print(df.shape[0], train_data.shape[0])
plt.show()
1452385 1452385

As we put the following caps on lat-long -

  • latitude should be between 40.6 to 40.9
  • longitude should be between -74.05 to -73.70

We get that the distribution spikes becomes as distribution in distplot (distplot is a histrogram plot in seaborn package), we can see that most of the trips are getting concentrated between these lat-long only.

In [19]:
test_data=test_df.copy()
In [20]:
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime)
train_data.loc[:, 'pick_date'] = train_data['pickup_datetime'].dt.date
train_data.head()
Out[20]:
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration pick_date
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455 2016-03-14
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663 2016-06-12
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124 2016-01-19
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429 2016-04-06
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435 2016-03-26
In [21]:
test_data['pickup_datetime'] = pd.to_datetime(test_data.pickup_datetime)
test_data.loc[:, 'pick_date'] = test_data['pickup_datetime'].dt.date
test_data.head()
Out[21]:
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag pick_date
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N 2016-06-30
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N 2016-06-30
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N 2016-06-30
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N 2016-06-30
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N 2016-06-30
In [22]:
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime)
train_data.loc[:, 'pick_month'] = train_data['pickup_datetime'].dt.month
train_data.loc[:, 'hour'] = train_data['pickup_datetime'].dt.hour
train_data.loc[:, 'week_of_year'] = train_data['pickup_datetime'].dt.weekofyear
train_data.loc[:, 'day_of_year'] = train_data['pickup_datetime'].dt.dayofyear
train_data.loc[:, 'day_of_week'] = train_data['pickup_datetime'].dt.dayofweek
In [23]:
train_data.head()
Out[23]:
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration pick_date pick_month hour week_of_year day_of_year day_of_week
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455 2016-03-14 3 17 11 74 0
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663 2016-06-12 6 0 23 164 6
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124 2016-01-19 1 11 3 19 1
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429 2016-04-06 4 19 14 97 2
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435 2016-03-26 3 13 12 86 5
In [24]:
train_data['dropoff_datetime'] = pd.to_datetime(train_data.dropoff_datetime)
train_data.loc[:, 'drop_month'] = train_data['dropoff_datetime'].dt.month
train_data.loc[:, 'drop_hour'] = train_data['dropoff_datetime'].dt.hour
train_data.loc[:, 'drop_week_of_year'] = train_data['dropoff_datetime'].dt.weekofyear
train_data.loc[:, 'drop_day_of_year'] = train_data['dropoff_datetime'].dt.dayofyear
train_data.loc[:, 'drop_day_of_week'] = train_data['dropoff_datetime'].dt.dayofweek
In [25]:
test_data['pickup_datetime'] = pd.to_datetime(test_data.pickup_datetime)
test_data.loc[:, 'pick_month'] = test_data['pickup_datetime'].dt.month
test_data.loc[:, 'hour'] = test_data['pickup_datetime'].dt.hour
test_data.loc[:, 'week_of_year'] = test_data['pickup_datetime'].dt.weekofyear
test_data.loc[:, 'day_of_year'] = test_data['pickup_datetime'].dt.dayofyear
test_data.loc[:, 'day_of_week'] = test_data['pickup_datetime'].dt.dayofweek
In [26]:
test_data.head()
Out[26]:
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag pick_date pick_month hour week_of_year day_of_year day_of_week
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N 2016-06-30 6 23 26 182 3
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N 2016-06-30 6 23 26 182 3
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N 2016-06-30 6 23 26 182 3
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N 2016-06-30 6 23 26 182 3
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N 2016-06-30 6 23 26 182 3
In [27]:
'''test_data['dropoff_datetime'] = pd.to_datetime(test_data.dropoff_datetime)
test_data.loc[:, 'drop_month'] = test_data['dropoff_datetime'].dt.month
test_data.loc[:, 'drop_hour'] = test_data['dropoff_datetime'].dt.hour
test_data.loc[:, 'drop_week_of_year'] = test_data['dropoff_datetime'].dt.weekofyear
test_data.loc[:, 'drop_day_of_year'] = test_data['dropoff_datetime'].dt.dayofyear
test_data.loc[:, 'drop_day_of_week'] = test_data['dropoff_datetime'].dt.dayofweek'''
Out[27]:
"test_data['dropoff_datetime'] = pd.to_datetime(test_data.dropoff_datetime)\ntest_data.loc[:, 'drop_month'] = test_data['dropoff_datetime'].dt.month\ntest_data.loc[:, 'drop_hour'] = test_data['dropoff_datetime'].dt.hour\ntest_data.loc[:, 'drop_week_of_year'] = test_data['dropoff_datetime'].dt.weekofyear\ntest_data.loc[:, 'drop_day_of_year'] = test_data['dropoff_datetime'].dt.dayofyear\ntest_data.loc[:, 'drop_day_of_week'] = test_data['dropoff_datetime'].dt.dayofweek"
In [28]:
feature_sel_data=train_data.copy()
In [29]:
feature_sel_data=feature_sel_data.drop(["pickup_datetime","dropoff_datetime","pick_date","id"], axis=1)
In [30]:
feature_sel_data["store_and_fwd_flag"].replace(['N','Y'],[0,1],inplace=True)
In [31]:
feature_sel_data.head()
Out[31]:
vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration pick_month hour week_of_year day_of_year day_of_week drop_month drop_hour drop_week_of_year drop_day_of_year drop_day_of_week
0 2 1 -73.982155 40.767937 -73.964630 40.765602 0 455 3 17 11 74 0 3 17 11 74 0
1 1 1 -73.980415 40.738564 -73.999481 40.731152 0 663 6 0 23 164 6 6 0 23 164 6
2 2 1 -73.979027 40.763939 -74.005333 40.710087 0 2124 1 11 3 19 1 1 12 3 19 1
3 2 1 -74.010040 40.719971 -74.012268 40.706718 0 429 4 19 14 97 2 4 19 14 97 2
4 2 1 -73.973053 40.793209 -73.972923 40.782520 0 435 3 13 12 86 5 3 13 12 86 5
In [32]:
trip_d=feature_sel_data['trip_duration']
In [33]:
feature_sel_data=feature_sel_data.drop(('trip_duration'), axis=1)
In [34]:
feature_sel_data["trip_duration"]=trip_d
In [35]:
#len(feature_sel_data.columns)
In [36]:
#USING RFE
from sklearn.linear_model import RandomizedLasso
A = feature_sel_data.iloc[:,0:17]
B = feature_sel_data["trip_duration"]
column_names=feature_sel_data.columns


rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(A,B)
Out[36]:
RandomizedLasso(alpha=0.025, eps=2.2204460492503131e-16, fit_intercept=True,
        max_iter=500, memory=Memory(cachedir=None), n_jobs=1,
        n_resampling=200, normalize=True, pre_dispatch='3*n_jobs',
        precompute='auto', random_state=None, sample_fraction=0.75,
        scaling=0.5, selection_threshold=0.25, verbose=False)
In [37]:
print ("Features sorted by their score:")
print (sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 column_names), reverse=True))
Features sorted by their score:
[(1.0, 'vendor_id'), (1.0, 'pickup_longitude'), (1.0, 'pickup_latitude'), (1.0, 'dropoff_longitude'), (1.0, 'dropoff_latitude'), (0.48999999999999999, 'drop_day_of_year'), (0.25, 'drop_month'), (0.115, 'passenger_count'), (0.055, 'pick_month'), (0.055, 'day_of_year'), (0.0050000000000000001, 'hour'), (0.0, 'week_of_year'), (0.0, 'store_and_fwd_flag'), (0.0, 'drop_week_of_year'), (0.0, 'drop_hour'), (0.0, 'drop_day_of_week'), (0.0, 'day_of_week')]

USING RFE WITH LINEAR REGRESSION AND RANKING IT

In [38]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
In [39]:
#use linear regression as the model
lr = LinearRegression()
In [40]:
A = feature_sel_data.iloc[:,0:17]
B = feature_sel_data["trip_duration"]
column_names=feature_sel_data.columns

#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=None)
rfe.fit(A,B)
Out[40]:
RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
  n_features_to_select=None, step=1, verbose=0)
In [41]:
print ("Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), column_names)))
Features sorted by their rank:
[(1, 'day_of_year'), (1, 'drop_day_of_year'), (1, 'drop_hour'), (1, 'drop_month'), (1, 'dropoff_latitude'), (1, 'dropoff_longitude'), (1, 'hour'), (1, 'pickup_longitude'), (2, 'pick_month'), (3, 'drop_day_of_week'), (4, 'day_of_week'), (5, 'pickup_latitude'), (6, 'drop_week_of_year'), (7, 'week_of_year'), (8, 'vendor_id'), (9, 'store_and_fwd_flag'), (10, 'passenger_count')]
In [42]:
booking_max_hour=feature_sel_data.groupby("hour").size()
In [43]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(15,10))

train_data.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax1)
ax1.set_title("Pickups")
ax1.set_facecolor('black')

train_data.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax2)
ax2.set_title("Dropoffs")
ax2.set_facecolor('black') 
In [44]:
import folium 
newyork_map = folium.Map(location=[40.767937,-73.982155 ],tiles='Mapbox Bright',
 zoom_start=10)
for row in train_data[:2000].iterrows():
    folium.RegularPolygonMarker([row[1]['pickup_latitude'],row[1]['pickup_longitude'],row[1]['dropoff_latitude'],row[1]['dropoff_longitude']],
                        radius=3,
                        color='gold',
                        popup=str(row[1]['passenger_count'])+','+str(row[1]['vendor_id']),
                        fill_color='#FD8A6C'
                        ).add_to(newyork_map)
In [107]:
newyork_map
Out[107]:
In [137]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

rgb = np.zeros((3000, 3500, 3), dtype=np.uint8)
rgb[..., 0] = 0
rgb[..., 1] = 0
rgb[..., 2] = 0
train_data['pick_lat_new'] = list(map(int, (train_data['pickup_latitude'] - (40.6000))*10000))
train_data['drop_lat_new'] = list(map(int, (train_data['dropoff_latitude'] - (40.6000))*10000))
train_data['pick_lon_new'] = list(map(int, (train_data['pickup_longitude'] - (-74.050))*10000))
train_data['drop_lon_new'] = list(map(int,(train_data['dropoff_longitude'] - (-74.050))*10000))

summary_plot = pd.DataFrame(train_data.groupby(['pick_lat_new', 'pick_lon_new'])['id'].count())

summary_plot.reset_index(inplace = True)
summary_plot.head(120)
lat_list = summary_plot['pick_lat_new'].unique()

for i in lat_list:
    #print(i)
    lon_list = summary_plot.loc[summary_plot['pick_lat_new']==i]['pick_lon_new'].tolist()
    unit = summary_plot.loc[summary_plot['pick_lat_new']==i]['id'].tolist()
    for j in lon_list:
        #j = int(j)
        a = unit[lon_list.index(j)]
        #print(a)
        if (a//50) >0:
            rgb[i][j][0] = 255
            rgb[i,j, 1] = 255
            rgb[i,j, 2] = 0
        elif (a//10)>0:
            rgb[i,j, 0] = 0
            rgb[i,j, 1] = 255
            rgb[i,j, 2] = 0
        else:
            rgb[i,j, 0] = 255
            rgb[i,j, 1] = 0
            rgb[i,j, 2] = 0
fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(14,20))

ax.imshow(rgb, cmap = 'hot')
ax.set_axis_off() 

From the heatmap kind of image above -

  • Red points signifies that 1-10 trips in the given data have that point as pickup point
  • Green points signifies that more than 10-50 trips in the given data have that point as pickup point
  • Yellow points signifies that more than 50+ trips in the given data have that point as pickup point
In [47]:
feature_sel_data.columns
Out[47]:
Index(['vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag',
       'pick_month', 'hour', 'week_of_year', 'day_of_year', 'day_of_week',
       'drop_month', 'drop_hour', 'drop_week_of_year', 'drop_day_of_year',
       'drop_day_of_week', 'trip_duration'],
      dtype='object')
In [48]:
df_pick = feature_sel_data[['pickup_longitude','pickup_latitude']]
In [49]:
df_drop = feature_sel_data[['dropoff_longitude','dropoff_latitude']]
In [50]:
data_for_grouping = feature_sel_data.copy()
In [51]:
saff_pickup=data_for_grouping.groupby(('pickup_longitude','pickup_latitude')).size()
In [52]:
saff_dropoff=data_for_grouping.groupby(('dropoff_longitude','dropoff_latitude')).size()
In [53]:
def haversine_(lat1, lng1, lat2, lng2):
    """function to calculate haversine distance between two co-ordinates"""
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  # in km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return(h)

def manhattan_distance_pd(lat1, lng1, lat2, lng2):
    """function to calculate manhattan distance between pick_drop"""
    a = haversine_(lat1, lng1, lat1, lng2)
    b = haversine_(lat1, lng1, lat2, lng1)
    return a + b

import math
def bearing_array(lat1, lng1, lat2, lng2):
    """ function was taken from beluga's notebook as this function works on array
    while my function used to work on individual elements and was noticably slow"""
    AVG_EARTH_RADIUS = 6371  # in km
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))
In [54]:
#train_data = train_df

train_data.loc[:,'hvsine_pick_drop'] = haversine_(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)
train_data.loc[:,'manhtn_pick_drop'] = manhattan_distance_pd(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)
train_data.loc[:,'bearing'] = bearing_array(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)
In [55]:
train_data.head()
Out[55]:
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... drop_week_of_year drop_day_of_year drop_day_of_week pick_lat_new drop_lat_new pick_lon_new drop_lon_new hvsine_pick_drop manhtn_pick_drop bearing
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... 11 74 0 1679 1656 678 853 1.498521 1.735433 99.970196
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... 23 164 6 1385 1311 695 505 1.805507 2.430506 -117.153768
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... 3 19 1 1639 1100 709 446 6.385098 8.203575 -159.680165
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... 14 97 2 1199 1067 399 377 1.485498 1.661331 -172.737700
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... 12 86 5 1932 1825 769 770 1.188588 1.199457 179.473585

5 rows × 29 columns

In [56]:
model_df=train_data[['day_of_year', 'drop_day_of_year', 'drop_hour', 'drop_month', 
                    'dropoff_latitude','dropoff_longitude', 'hour', 'pickup_longitude','trip_duration']].copy()
In [57]:
model_df.columns
Out[57]:
Index(['day_of_year', 'drop_day_of_year', 'drop_hour', 'drop_month',
       'dropoff_latitude', 'dropoff_longitude', 'hour', 'pickup_longitude',
       'trip_duration'],
      dtype='object')
In [58]:
model_df.to_csv('/Users/as186194/Documents/DOCUMENTS/TRIALS/Kaggle/Kaggle_NYC_Taxi/model_df.csv' , sep=',')
In [59]:
'''from sklearn.model_selection import train_test_split
X = model_df.iloc[:,1:8]
Y = model_df["trip_duration"]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.3)'''
Out[59]:
'from sklearn.model_selection import train_test_split\nX = model_df.iloc[:,1:8]\nY = model_df["trip_duration"]\nX_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.3)'
In [92]:
models_try=pd.read_csv(INPUT_FOLDER + 'model_df.csv', nrows=5000)
In [93]:
models_try = models_try.drop("Unnamed: 0", axis =1)
In [94]:
models_try.head()
Out[94]:
day_of_year drop_day_of_year drop_hour drop_month dropoff_latitude dropoff_longitude hour pickup_longitude trip_duration
0 74 74 17 3 40.765602 -73.964630 17 -73.982155 455
1 164 164 0 6 40.731152 -73.999481 0 -73.980415 663
2 19 19 12 1 40.710087 -74.005333 11 -73.979027 2124
3 97 97 19 4 40.706718 -74.012268 19 -74.010040 429
4 86 86 13 3 40.782520 -73.972923 13 -73.973053 435
In [63]:
predict = models_try['trip_duration']
In [64]:
predictors = models_try.drop('trip_duration', 1)
In [108]:
clf = RandomForestClassifier(n_jobs=2)

clf.fit(predictors, predict)
clf.score(predictors, predict)
Out[108]:
0.99480000000000002
In [68]:
models_test=test_data[['day_of_year','day_of_year', 'hour', 'pick_month', 
                    'dropoff_latitude','dropoff_longitude', 'hour', 'pickup_longitude']].copy()
In [96]:
target= models_try.trip_duration
In [97]:
models_try = models_try.drop('trip_duration',1)
In [70]:
#prediction = clf.predict(models_test)
In [98]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=1000, min_samples_leaf=50, min_samples_split=75)
In [99]:
rf_model.fit(models_try.values, target)
Out[99]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=50,
           min_samples_split=75, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
In [100]:
predictions=rf_model.predict(models_test.values)
predictions[:5]
Out[100]:
array([  714.24758345,  1326.80513069,   699.8540203 ,  1100.04366198,
         557.71578548])
In [106]:
predictions[len(predictions)-5:]
Out[106]:
array([ 1116.38440187,  1578.95358328,  1839.03486568,  1913.8835153 ,
        1383.56172866])
In [112]:
test_df=test_df[:15000]
In [114]:
len(predictions)
Out[114]:
625134
In [122]:
test_df['trip_duration'] = predictions
In [123]:
Id=test_df['id']
In [128]:
test_df[['id', 'trip_duration']].to_csv( INPUT_FOLDER + 'arnav.csv.gz', index=False, compression='gzip')
In [129]:
test_df['trip_duration'][:5]
Out[129]:
0     714.247583
1    1326.805131
2     699.854020
3    1100.043662
4     557.715785
Name: trip_duration, dtype: float64
In [133]:
test_df.head()
Out[133]:
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N 714.247583
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N 1326.805131
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N 699.854020
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N 1100.043662
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N 557.715785